# Managing Power and Reliability of Integrated Systems

Tajana Šimunić Rosing UCSD

# Introduction

#### Systems on Chip (SOCs)

Designed as tightly interconnected set of cores

#### Deterministic design paradigm is increasingly harder to implement

 parameter spread in deep submicron technologies, increasing system complexity, reduced noise immunity, power and thermal management issues

Networks on Chips (NOCs, also known as Multi-Processor SOCs)

- Treat SOCs as micro-networks in order to leverage uncertainty assume incomplete knowledge of the environment
- Interconnects designed using networking concepts
  - Multiple concurrent connections higher bandwidth
  - Regular structure optimized global wire design, modularity
  - Error correction/control leverage ARQ, FEC etc. from networking
  - Traffic scheduling lower latency
- Big issue is a tradeoff between power, performance and reliability
  - Lower power consumption -> lower temperature, better reliability
  - But frequent switching between power states can cause a significant decrease in reliability careful optimization is needed !

## **Network on a Chip**



| Specifications       | Audio | Video | Speech | Comm. | Total |
|----------------------|-------|-------|--------|-------|-------|
| Active (mW)          | 700   | 1885  | 1500   | 1055  | 5140  |
| Idle (mW)            | 216   | 235   | 1000   | 208   | 1659  |
| Sleep (mW)           | 0.3   | 1.4   | 100    | 0.6   | 102.2 |
| A-S-A (ms)           | 45.6  | 54.6  | 40     | 54.6  | 54.6  |
| <b>#DVS</b> settings | 11    | 11    | 3      | 11    | 11    |
| DVS switch (us)      | 150   | 150   | 100    | 150   | 150   |

Tajana Simunic Rosing

# **Power Manager Implementation**



- Power management
  - Node-centric fully contained in a local power manager
  - Network-centric network power management requests
- Local power manager implements closed-loop power management:
  - Estimator
    - Observes incoming core traffic, core state & network PM requests
    - Estimates parameters used in recalculation of power management policy
  - Controller
    - Sets core's energy and performance states based on estimator input

Tajana Simunic Rosing

# **Node-centric PM**

### Power management is based on Renewal Model



## **Renewal Time Formulation**

### Define: renewal time Ta time of first request arrival jh time of transition to sleep from idle state (j=index, h=time increment) **A**,( A,Q S $E[T] = E[T | T_a < jh] + E[T | T_a > jh]$

E[Length of idle period] + E[Time service request] E[Length of idle period]+ E[Time to sleep] + E[Length of sleep] + E[Time to active] + E[Time to service all requests]

## **Energy & Performance Formulation**

- Energy and performance are calculated for each state using:
  - constant C<sub>i</sub> (e.g. power consumption in state i)
  - expected time spent in the state, E[Ti]
  - probability of the first request arrival, P(Ta)



# **Renewal Policy Optimization**

#### Basic assumptions:

- general distribution governs the first request arrival
- exponential distribution represents arrivals after the first arrival
- user, device and queue are stationary
- Optimize average performance under average power constraint
  randomized policy

$$\min \quad \frac{\sum_{j} d(j) p(j)}{\sum_{j} T(j) p(j)}$$
  
s.t. 
$$\sum_{j}^{j} (Energy (j) - P_{constr} T(j)) p(j) = 0$$
  
$$\sum_{j}^{j} p(j) = 1$$

Globally optimal policy calculated in seconds using LP

### **Closed-loop Renewal Policy Optimization**

Formulate dual of the Lagrangian
 Variables v,u & λ are the Lagrangian multipliers

min v

s.t.  $d(j) + v(j)t(j) + u(j)[e(j) - t(j)P_{constr}] - \lambda(j) = 0 \quad \forall j$ 

Obtain a minimum crossing point of a set of lines specified by the following equation:

$$w(j) \le \frac{d(j)}{t(j)} + u(j) \frac{[e(j) - t(j)P_{constr}]}{t(j)}$$

 Indexes of Lagrangian multipliers which form a solution, together with original constraints, are used to obtain the probabilities of transitioning into sleep state

Real-time closed-loop control is possible

Globally optimal policy calculated in milliseconds

# **Node-centric estimation**

Estimation of exponential and Pareto distribution parameters

**Exponential** 

$$Exp = 1 - e^{-\lambda_e t}$$

- calculate maximum likelihood ratio for all rate settings
- calculate interarrival (or interservice) time sums (Σt<sub>i</sub>)
- evaluate natural log of maximum likelihood ratio, ln (P<sub>max</sub>)

$$\ln(P_{\max}) = k \ln \frac{\lambda_{new}}{\lambda_{old}} - (\lambda_{new} - \lambda_{old}) \sum_{j=n_{change}}^{n_{points}} t_j$$

 if ratio is larger than the one obtained from the lookup table, assume that the rate has changed Pareto

$$Pareto = 1 - b \cdot t^{-a}$$

estimate parameters a & b using leastsquares method on the log of Pareto distribution



# **Controller implementation**

- Consists of LFSR for generating probability & policy logic
- Controller on entry to idle state:
  - obtains a random number RND & finds jh for which RND>p(jh)
  - if no arrival during jh seconds, the core enters sleep state, otherwise it stays active
- Frequency and voltage are set so the average expected processing delay in the queue is kept constant:  $\lambda$  .



**LFSR Bits** 

#### **Optimal Policy**

| ldle time | Probability |
|-----------|-------------|
| (ms)      | to sleep    |
| jh        | p(jh)       |
| 0         | 0.00        |
| 10        | 0.00        |
| 20        | 0.12        |
| 30        | 0.43        |
| 40        | 0.75        |
| 50        | 0.87        |
| 60        | 0.91        |
| 70        | 1.00        |

#### **FPGA synthesis**

| LFSR | LFSR Regs      |   | Policy |         |
|------|----------------|---|--------|---------|
| Bits | # LABs M ax ns |   | # LABs | M ax ns |
| 5-15 | 1              | 4 | 2      | 35      |

#### Synposys synthesis

| LFSR Regs |        | Policy |        |
|-----------|--------|--------|--------|
| #FFs      | % area | #gates | % area |
| 5         | 14%    | 193    | 86%    |
| 9         | 14%    | 417    | 86%    |
| 15        | 12%    | 855    | 87%    |

# **Network centric PM**



Tajana Simunic Rosing

# **Network centric PM implementation**

- Estimator continues to have the same function as before
- Renewal model is expanded to include network requests
- Controller implementation changes:
  - When all network cores release the local core, the probability of transition to sleep is 1.0

|         | Idle Time |                        |
|---------|-----------|------------------------|
| Source  | (ms)      | Transition Probability |
| Node    | 0         | 0                      |
|         | 70        | 0.3                    |
|         | 120       | 1                      |
| Network | Any time  | 1                      |

As soon as a request comes from a network core to the local core, the local core transitions to the active state with probability 1.0

Node-centric PM is still needed to implement DVS and PM in situations when early network requests are not available

## **Network-centric results**

| Specifications       | Audio | Video | Speech | Comm. | Total |
|----------------------|-------|-------|--------|-------|-------|
| Active (mW)          | 700   | 1885  | 1500   | 1055  | 5140  |
| Idle (mW)            | 216   | 235   | 1000   | 208   | 1659  |
| Sleep (mW)           | 0.3   | 1.4   | 100    | 0.6   | 102.2 |
| A-S-A (ms)           | 45.6  | 54.6  | 40     | 54.6  | 54.6  |
| <b>#DVS</b> settings | 11    | 11    | 3      | 11    | 11    |
| DVS switch (us)      | 150   | 150   | 100    | 150   | 150   |



#### **Power savings factor**

| PM      | РМ Туре   | Audio | Video | Comm. | Speech | Total |
|---------|-----------|-------|-------|-------|--------|-------|
| None    | None      | 1     | 1     | 1     | 1      | 1     |
| Node    | DVS only  | 1.4   | 2     | 1     | 3.9    | 1.2   |
| Centric | DPM only  | 2     | 1.5   | 3     | 2      | 2.4   |
|         | DVS & DPM | 2.9   | 2.9   | 3     | 5.8    | 2.9   |
| Network | DVS & DPM | 3.7   | 3.6   | 4.2   | 6.4    | 4.1   |

Network-centric DPM increases power savings from a factor of 2.9 to 4.1, while at the same time reducing performance penalty by more than 10%

# Joint Power and Reliability Management

### **Integrated System Technology Issues**

- Extremely small size
  - Thinner interconnect -> more chance of EM failure
  - Thinner dielectric -> more chance of TDDB failure
  - Narrower design margins
- Extremely large scale
  - High transistor density
    - Causes more failures
    - Enables redundancy
- Energy consumption



- Increased energy consumption is a hurdle to modular redundancy
- Power and thermal management are critical
  - Reliability exponentially related to temperature

Designing reliable integrated systems requires techniques that integrate with power management and tie to the underlying technology



Fajana Simunic Rosing

# Hard errors

- Defects in silicon or package, permanent once present
- Integrated system lifetime is inversely proportional to the hard error rate
  - ✤ Extrinsic
    - caused by process and manufacturing defects
    - usually screened out before shipping a product
  - ✤ Intrinsic
    - occur during operation
    - depend on materials used, process parameters, system design and operating conditions
    - should occur after device passes its useful lifetime
    - Examples: electromigration, time dependent dielectric breakdown, thermal cycling

# **Electromigration (EM)**

- Result of momentum transfer from electrons to the ions which make interconnect lattice
- Leads to opening of metal lines/contacts, shortening between adjacent metal lines, shortening between metal levels, increased resistance of metal lines/contacts or junction shortening
- Described by Black's model where A<sub>p</sub> is an empirically determined constant, J is the current density in the interconnect, J<sub>crit</sub> is the threshold current density, k is the Boltzmann's constant, E<sub>a</sub> and n are 0.7 and 2

$$MTTF_{EM} = A_o (J - J_{crit})^{-n} e^{\frac{Ea}{kT}}$$

 Failure rate due to EM is modeled only in active and idle states as in sleep state leakage current is not yet large enough to cause migration:

$$\lambda_{core,s}^{EM} = A_o' (J_s - J_{crit})^n e^{\frac{-Ea}{kT_s}}$$
$$\forall s = active, idle$$

Tajana Simunic Rosing

### Time Dependent Dielectric Breakdown (TDDB)

- Wear out mechanism of dielectric due electric field and temperature; causes formation of conductive paths through dielectrics shortening the anode and cathode
- MTTF is a function of the empirically determined constant  $A_o$ , the field acceleration parameter  $\gamma$ , the electric field across the dielectric  $E_{ox}$  the activation energy  $E_a$  and temperature T

$$MTTF_{TDDB} = A_o e^{-\gamma E_{ox}} e^{\frac{Ea}{kT}}$$

Failure rate due to TDDB:

$$\lambda_{core,s}^{TDDB} = A_o' e^{\gamma E_{ox,s}} e^{\frac{-Ea}{kT_s}};$$
  
$$\forall s = active, idle, sleep$$

# **Temperature Cycling (TC)**

- Caused by thermal cycles that occur during power state changes
  Slow and fast thermal cycles
- Induces plastic deformations in materials leads to cracks, short circuits and other failures of metal films and interlayer dielectrics
- Depends on temperature range and average temperature:

$$N_{f} = C_{o} \left[ C_{1} \left( T_{\max} - T_{\min} \right) - C_{2} \left( T_{avg} - T_{mold} \right) \right]^{-q}$$

Failure rate due to TC:

$$\lambda_{core,s}^{TC} = C_o \left[ \left( T_{active} - T_s \right) - \left( T_{avg,s} - T_{mold} \right) \right]^{-q} t^{-1} \quad \forall s = sleep$$

# **Basic Reliability Configurations**

Active parallel configuration has all redundant components working concurrently

- Energy consumption is high
- Time to transition on failure is very low
- Failure rate is higher than standby parallel
- E.g. identical controllers for a nuclear reactor
- Standby parallel configuration has redundant components in low-power mode until failure of the active component
  - Energy consumption lower
  - Time to transition on failure higher
  - Low failure rate
  - ✤ E.g. dual CPU platform
- Series combination has the highest failure rate
  - E.g. CPU, memory, interconnect



 $\lambda_{fap} = (\sum_{i=1}^{m} (-1)^{i-1} \frac{C_i^m}{i\lambda_x})^{-1}$ 



## **Focus of this work**

### Analyze system-level reliability

- ✤ as a function of a power management policy
  - Analysis of single core and multiple core system
  - TC effect can dominate at small feature sizes, thus causing a large drop in reliability with aggressive power management policies

#### Determine a system management policy

- to maximize reliability and minimize energy consumption
- Stochastic optimization problem
- Time-indexed Markov chain model
  - Combined reliability with power management optimization

# System analysis

Analytical models are possible only under very strict assumptions on topology and failure rates

- Simulation of system evolution
  - For a given topology
  - ✤ For a given policy

Challenges:

Stiffness of time constants

 MTTF much longer than system transition times and environment arrival times of events

# **Simulator design**

- Event-driven stochastic simulator
- Accepts any workload distribution, including raw data

### Input:

- Reliability topology and failure mechanism characteristics (e.g. EM)
- Power state specification
- Workload
- ✤ Time horizon
- Output:
  - MTTF, system reliability, energy consumption, performance
  - ✤ For a system of 10 cores runs in a few seconds

# **Simulation Results**

#### 95nm technology

- Video core becomes less reliable with lower temperatures as TC dominates; at higher temperatures EM and TDDB overpower TC so reliability improves
- Audio core reliability falls with power consumption as TC mechanism dominates at all temperatures





#### Video

#### **Audio**

|       | P <sub>active</sub> |                       |                        | t <sub>ts</sub> | t <sub>ta</sub> | $\lambda_{\rm core}$ | $\lambda_{workload}$ |
|-------|---------------------|-----------------------|------------------------|-----------------|-----------------|----------------------|----------------------|
|       | [W]                 | P <sub>idle</sub> [W] | P <sub>sleep</sub> [W] | [ms]            | [ms]            | [s <sup>-1</sup> ]   | [s <sup>-1</sup> ]   |
| video | 1.5                 | 1                     | 0.65                   | 40              | 40              | 100                  | 1                    |
| audio | 0.7                 | 0.2                   | 3.00E-04               | 40              | 40              | 10                   | 0.1                  |

**Fajana Simunic Rosing** 

# **DPM&DRM - Dependability modeling**



### Markov processes model memoryless systems with constant failure rates

## **DPM&DRM - Power management modeling**



# **TISMDP Model for Joint DPM & DRM**

(Time-Indexed Semi-Markov Decision Process Model)

- allows multiple decision states (e.g multiple low-power states)
- more general and more complex method as compared to Renewal
- guarantees optimal results
- base model is Semi-Markov decision process model
  applies to states with at least one exponential transition
- time-indexing is needed to account for time in states where more than one non-exponential transition occurs
- same basic assumptions as with Renewal model:
  general distribution governs the first request arrival
  exponential distribution represents arrivals after the first arrival
  user, device and queue are stationary

# **DPM&DRM System Model Details**

- Combine:
  - Power-state machine model TISMDP
  - Reliability model Markov process
- Represent overall system as combination of components' PSMs where failure rates depend on system state
- System control aims to increase energy efficiency and enhance reliability



# **DPM&DRM Policy Optimization**

Minimize average energy consumed under reliability and performance constraints – get randomized policy

 $\min \sum_{c=1}^{N} \operatorname{cost}_{energy, c}$ s.t.  $\sum_{a \in A}^{c=1} f(s, a) - \sum_{a \in A} \sum_{s' \in S} M(s' | s, a) f(s', a) = 0; \quad \forall s, \forall c_s$   $\sum_{a \in A}^{n} \sum_{s \in S} T(s, a) f(s, a) = 1; \quad \forall c_s$   $\sum_{c=1}^{N} \operatorname{cost}_{perf, c} < \operatorname{Perf}_{const}; \quad \forall c$   $Tpl(\lambda_c) \leq \operatorname{Rel}_{const}; \quad \forall c_s$   $\lambda_c = \sum_{i \in F} \sum_{a \in A} \sum_{s \in S} \lambda_{core}^i (s, a) y(s, a) f(s, a)$ 

Variable definitions:

| average cost incurred while in                   |
|--------------------------------------------------|
| state s given action a                           |
| frequency of executing action a while in state s |
| probability of arriving to state s'              |
| given action a taken in state s                  |
| expected time spent in state s                   |
| given action <mark>a</mark>                      |
| reliability constraint as a                      |
| function of network topology Tpl                 |
| core reliability                                 |
|                                                  |

### Obtain globally optimal policy using linear programming

Policy is obtained from state-action frequencies f(s,a) in form of a table of probabilities of issuing command a when system is in state s  $p(s,a) = \frac{f(s,a)}{\sum f(s,a')}$ 

**Fajana Simunic Rosing** 

# **DPM Constraint Formulation**

### Energy and performance cost:

- \*  $k(s_i, a_i)$  lump sum cost
- ★  $c(s_{i+1}, s_i, a_i)$  cost rate (e.g. power or performance penalty)
- \*  $F(t_i | s_i, a_j)$  probability distribution of next event occurrence
- ♦  $p(s_{i+1} | t_{i'} s_{i'} a_{j})$  probability of transition into next state  $s_{i+1}$

$$Cost(s_{i}, a_{i}) = \begin{cases} k(s_{i}, a_{i}) + \int_{0}^{\infty} \left[ F(du \mid s_{i}, a_{i}) \sum_{s_{i+1} \in S_{i+1}} \int_{0}^{u} c(s_{i+1}, s_{i}, a_{i}) p(s_{i+1} \mid t_{i}, s_{i}, a_{i}) dt \right] &\forall dt \\ k(s_{i}, a_{i}) + \sum_{s_{i+1} \in S_{i+1}} c(s_{i+1}, s_{i}, a_{i}) T(s_{i}, a_{i}) &\forall \Delta t \end{cases}$$

### Expected time spent in each state:

$$T(s_{i}, a_{i}) = \begin{cases} \int_{0}^{\infty} t \sum_{s_{i+1} \in S_{i+1}} p(s_{i+1} | t_{i}, s_{i}, a_{i}) F(dt | s_{i}, a_{i}) & \forall dt \\ \int_{0}^{t_{i} + \Delta t} \frac{(1 - F(t))dt}{1 - F(t_{i})} & \forall \Delta t \end{cases}$$

### Probability of arrival into each state:

$$M(s_{i+1} | s_i, a_i) = \begin{cases} \int_{0}^{\infty} p(s_{i+1} | t_i, s_i, a_i) F(dt | s_i, a_i) & dt \\ p(s_{i+1} | t_i, s_i, a_i) & \Delta t \end{cases}$$

**Tajana Simunic Rosing** 

## **Reliability Constraint Formulation**

 Failure rate of each state is a sum of the failure rates due to all mechanisms (EM, TDDB, TC) acting in that state
 Expected temperature in a state needs to be calculated

$$T_{state} = (T_{active} - T_{state,ss})e^{-\frac{y(s,a)}{\tau}} + T_{state,ss}$$
$$T_{active} \propto P_{active}(R_{th\,die} + R_{th\,package})$$

Total failure rate of a core is a weighted sum of state failure rates, for example:

✤ core has three power states: active, idle and sleep

✤ two actions: "go to sleep" (S) and "continue" (C)

 $\lambda_A y(A, C) f(A, C) +$  $\lambda_I y(I, C) f(I, C) + \lambda_I y(I, S) f(I, S) +$  $\lambda_S y(S, C) f(S, C) \le \operatorname{Rel}_{const}$ 

 System failure rate is calculated based on core topology as a function of series and parallel combinations

# **Optimization Example: SOC Parameters**

#### 95nm technology

- Five cores; standard workloads (audio, video, www etc.)
- MTTF constraint set to 10 years; minimized power consumption



|                             |                         |                       |                        | t <sub>ts</sub> | t <sub>ta</sub> |
|-----------------------------|-------------------------|-----------------------|------------------------|-----------------|-----------------|
| IP block                    | P <sub>active</sub> [W] | P <sub>idle</sub> [W] | P <sub>sleep</sub> [W] | [s]             | [s]             |
| DSP (TMS6211) [22]          | 1.1                     | 0.5                   | 0.01                   | 250u            | 100n            |
| Video (SAF7113H) [23]       | 0.44                    | N/A                   | 0.07                   | 110m            | 0.9             |
| Audio (SST-Melody-DAA) [24] | 0.11                    | 0.03                  | 3.00E-03               | би              | 0.13            |
| I/O (MSP43011x2) [25]       | 1.00E-03                | N/A                   | 6.00E-06               | 100n            | би              |
| DRAM (Rambus 512M) [26]     | 1.58                    | 0.37                  | 1.00E-02               | 16n             | 16n             |

# **Optimization Results – Single Core**





 Maximum power savings achievable given MTTF of 10 years are at 90% for all cores and temperature ranges except for DSP, Video and Audio at 90 C due to TC mechanism

- Design change effect widening metal lines – plotted for each failure mechanism
- Current density down by 20%, core area up by 5%, temperature down by 2%, but TC up by 10%

# **Optimization Results - Redundancy**

 Standby off and standby sleep redundancy model power savings with MTTF set to 10 years



| ٦ | comm. | audio | video | DRAM |
|---|-------|-------|-------|------|
| L | comm. | audio | video | DRAM |

System meets MTTF of 10 years when one more redundant core in standby off mode is added to DSP, Audio and I/O; power savings of 40% are achieved

# Summary

- SOCs are rapidly evolving into NOCs
- Reliability is of increasing concern and should be closely correlated with power management
- This work presents an integrated methodology for analysis, optimization and management of reliability and power consumption:
  - Simulator gives fast feedback on topology design and system characteristics for a wide range of operating conditions
  - Optimizer provides a policy capable of giving an optimal implementation of reliability and power management control
- Results obtained for a number of integrated systems implemented in 95nm technology show:
  - Large dependence between power management policy and reliability due to tradeoff between EM, TDDB and TC effects
  - 40% power savings on top of meeting MTTF of 10 years for an integrated system consisting of five cores with redundancy

### **Power Management for NOCs – Related Work**

- SOC interconnect standards [AMBA,CoreConnect,VSI,OCP]
- NOC architecture based on packet model
  - Fat tree router topology [Guerrier00]
  - Tiled architecture with flit-reservation flow control [Dally01]
  - Correct-by-construction protocol stack MESCAL tools [Sgroi01]
  - Circuit and packet switched routing [PhilipsNOC03]
- Reduction of energy consumption in NOCs (for overview see Benini04)
  - Maia processor has 21 satellite units; its configuration changes according to application needs – large energy savings [Wan00]
  - Energy efficient routing [e.g. Worm02, Yoshimura00, Nilsson03]
  - Node and network-centric power management suggested [Benini02]
- Recently proposed power management systems
  - Exclusively node-centric, with little or no outside information utilized
  - Power management & dynamic voltage scaling occur separately
  - Open loop control
    - policies designed once with no further optimization at run time

# **Reliability - Related work**

- Integrated simulation of power and reliability for processor design RAID [Bose et.al.'03]
- Fault-tolerant microarchitectures proposed in [Rotenberg'98]
- Redundancy at the architecture level [Shivakumar'03]
- Thermal management for multimedia [Srinivasan, Adve'03]
- Soft errors addressed by many, for example:
  - Ultra-low power systems [Maheshwari'02]
  - Sensing systems [Marculescu'03]
- Hard failure mechanisms studied at length in the past, e.g.:
  Temperature cycling [Huang'00]
  TDDB [Degraeve'98]